This work aims to compare red and white wines datasets. Both datasets are available on the dataset options here for this project.
There main question that we will try to answer is:
This report explores a dataset of red and white wines about many perspectives. Red wines dataset has information about 1,599 wines. White wines dataset has information about 4,898 wines. Both databases have 6,497 lines and 13 variables.
Red Wines:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
White Wines:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Compare both classes of wine for each attributes:
Histogram over all variables on the database:
Red and White wines over all attributes in median values:
Best and Worse Red Wines comparation:
Best and Worse White Wines comparation:
Red wines dataset has information about 1599 wines. White wines dataset has information about 4898 wines. Both databases have 6,497 lines and 13 variables.
The main feature in the data set is quality. We would like to determine which are best and minimal combination of features for determine the quality of a wine.
Others features that will help our analysis for both wines: age of wines, kind of grapes, price of the botter, region of wine, is a blend or not. For Red Wines we have visible differences when we compare hight and low quality wines. It is possible notice that alcohol, citric.acid and volatile.acidity are (apparently) inversely proportional. However, white wines have a remarkable difference in alcohol attribute and subtle differences in pH and density.
No. I created a new dataset joining red and white wines datasets.
It was necessary to adjust the dataset to make them tailored to use libraries to build the presented graphs.
Some observations:
Alcohol: In general, both wines (red and white) have the same distribution of alcoholic graduation but red wines have more alcohol than white wines. An interesting point is that we found white wines with 14% of alcohol concentration and red wines with 8% of alcohol concentration;
pH: In general, red wines have a pH bigger than white wines. At this point we must to do two considerations: 1) pH is a logarithm scale and does it mean that the small differences in this scale represents differences in fact of 10x; 2) When ph values are small it means an acid environment. Otherwise, when ph is increasing we have an alkaline environment. We can observe that ph and citric acid are inversely proportional and this is confirmed in our dataset. White wines are more acid and red wines are more alcoholic;
Acidity:
Citric Acid: In general, white wines are more citric than red wines and it is natural due to the grapes used in the process;
Volatile Acidity: In geral, red wines have more volatile acidity than white wines;
Fixed Acidity: In general, red wines have more fixed acidity than white wines;
Chlorides: In general, red wines have more chlorides than white wines probably relate to the physical-chemical production process. For both wines are many variability about this attribute;
Density: In general, red wines have more density than white wines. Density is an important factor to harmonize with fat because of that it is common to serve red wine with fatty meats. This is an expected result. White wines are refreshing and much density is not interesting for this propouse;
- Sulphates: In general, red wines have more sulphates than white wines, but for both wines are a low variability for this variable.
Sulfur:
- Total and Free Sulfur Dioxide: Based in Sulfur Dioxide is used to prevent oxidation and microbial growth. However, an excessive amounts of SO2 can inhibit fermentation and cause undesirable sensory effects.
- Residual Sugar: In general, red wines have next to nothing residual sugar. White wines have more variability and more residual sugar than red wines. The distribution of this variability seems to be skewed;
- Quality: Even with different combinations of attributes, both wines arrives similar quality.
Conclusions:
There are many interesting things this graph shows to us:
All Wines:
Red Wines: We are interested in understanding the behavior of quality over other variables considering just red wines.
White Wines:
Now we are interested in understanding the behavior of quality over other variables considering just white wines.
All Wines The differences between best and worst wines are subtle for both types of wines (red and white). The data guide us to understand that more alcohol and citric acid associate with less density and chlorides is related to max quality in both types of wine. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid. Maybe chlorides and sulphates are substances added to process to get the balance of the wine.
Red Wines If we observe just red wines, maximum quality it is obtained when:
Best Best Red Wines has more alcohol, more citric acid, more sulphates, less ph, less density and chlorides. These attributes show the contrast between best and worse red wines.
White Wines If we observe just red wines, maximum quality it is obtained when:
Best Best white Wines has more alcohol, more citric acid, more sulphates, less density and chlorides when compared with worse wines. However, more ph while best red wines has less ph when we compare best and worse wines.
General Conclusions * For both types of wines the variables more related with quality are: alcohol, density and citric acid. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid;
Outliers Analysis: There is a white wine with max quality and a small percentage of alcohol. This is an interesting outlier to be analyzed. It is possible to realize that on this particular case the small percentage of alcohol was associated with higher values to residual sugar, fixed acidity and density. Maybe to give to this exemplar the balance needed.
At this point we can see alcohol and density related with quality for all wines, but not acidity. We also tried to relate acidity variables with alcohol and density but without success. After that, we tried do build a function that relate alcohol and density to explain quality. We build three models to explain quality:
\[ f(a,d) = \sqrt{a . d} \] * 2) Geometric mean between alcohol, density and citric acid;
\[ f(a,d,c) = \sqrt[3]{a . d . c} \] * 3) Proportion between alcohol and density;
\[ f(a,d) = \frac{a}{d} \]
The first model understand quality as a balance between alcohol and density. The second model understand quality as balance between alcohol, density and acidity (what is very related with the reality). Third model understand quality as a proportion between alcohol and density. We check the correlation results with quality values to measure the ability of model do explain the quality variable.
## Warning: Removed 2 rows containing missing values (geom_segment).
## Warning: Removed 2 rows containing missing values (geom_point).
### Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
strengths and limitations of your model.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Maybe we need more information about the wines, like year of production, grapes used to production, local of production (terroir) and other variables related to taste. We know that alcohol is able o explain about 45% of quality variable. Maybe join alcohol with other variables we can determine quality as so a evaluator.